New York Taxi Fare Dataset¶

In this notebook we will study the NY Taxi Fare Dataset (click on it to see the kaggle website). We will make firstly the data exploration, followed by the feature engineering, data cleaning and visualizing to finally test different models to predict the Fare price.

Disclaimer: During the execution of this notebook I had some problems due to the size of our dataset, based on this I decided to analyze only 100000 rows of it, and then only the year that had the highest traffic.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly_express as px
import seaborn as sns
import folium

palette = sns.color_palette("rainbow", 8)

1. Data Exploration¶

In [2]:
df = pd.read_csv('./data/train.csv', nrows= 100000)

df.head(10)
Out[2]:
key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
0 2009-06-15 17:26:21.0000001 4.5 2009-06-15 17:26:21 UTC -73.844311 40.721319 -73.841610 40.712278 1
1 2010-01-05 16:52:16.0000002 16.9 2010-01-05 16:52:16 UTC -74.016048 40.711303 -73.979268 40.782004 1
2 2011-08-18 00:35:00.00000049 5.7 2011-08-18 00:35:00 UTC -73.982738 40.761270 -73.991242 40.750562 2
3 2012-04-21 04:30:42.0000001 7.7 2012-04-21 04:30:42 UTC -73.987130 40.733143 -73.991567 40.758092 1
4 2010-03-09 07:51:00.000000135 5.3 2010-03-09 07:51:00 UTC -73.968095 40.768008 -73.956655 40.783762 1
5 2011-01-06 09:50:45.0000002 12.1 2011-01-06 09:50:45 UTC -74.000964 40.731630 -73.972892 40.758233 1
6 2012-11-20 20:35:00.0000001 7.5 2012-11-20 20:35:00 UTC -73.980002 40.751662 -73.973802 40.764842 1
7 2012-01-04 17:22:00.00000081 16.5 2012-01-04 17:22:00 UTC -73.951300 40.774138 -73.990095 40.751048 1
8 2012-12-03 13:10:00.000000125 9.0 2012-12-03 13:10:00 UTC -74.006462 40.726713 -73.993078 40.731628 1
9 2009-09-02 01:11:00.00000083 8.9 2009-09-02 01:11:00 UTC -73.980658 40.733873 -73.991540 40.758138 2
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   key                100000 non-null  object 
 1   fare_amount        100000 non-null  float64
 2   pickup_datetime    100000 non-null  object 
 3   pickup_longitude   100000 non-null  float64
 4   pickup_latitude    100000 non-null  float64
 5   dropoff_longitude  100000 non-null  float64
 6   dropoff_latitude   100000 non-null  float64
 7   passenger_count    100000 non-null  int64  
dtypes: float64(5), int64(1), object(2)
memory usage: 6.1+ MB
In [4]:
df.describe()
Out[4]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
count 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000 100000.000000
mean 11.354652 -72.494682 39.914481 -72.490967 39.919053 1.673820
std 9.716777 10.693934 6.225686 10.471386 6.213427 1.300171
min -44.900000 -736.550000 -74.007670 -84.654241 -74.006377 0.000000
25% 6.000000 -73.992041 40.734996 -73.991215 40.734182 1.000000
50% 8.500000 -73.981789 40.752765 -73.980000 40.753243 1.000000
75% 12.500000 -73.966982 40.767258 -73.963433 40.768166 2.000000
max 200.000000 40.787575 401.083332 40.851027 404.616667 6.000000
In [5]:
df.isnull().sum()
Out[5]:
key                  0
fare_amount          0
pickup_datetime      0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    0
dropoff_latitude     0
passenger_count      0
dtype: int64

2. Feature Enginnering¶

In [6]:
df_copy = df.copy()
## A little bit of cleaning to drop the duplicates and nuls before start the analysis
df_copy = df_copy.dropna()
df_copy = df_copy.drop_duplicates()

df_copy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   key                100000 non-null  object 
 1   fare_amount        100000 non-null  float64
 2   pickup_datetime    100000 non-null  object 
 3   pickup_longitude   100000 non-null  float64
 4   pickup_latitude    100000 non-null  float64
 5   dropoff_longitude  100000 non-null  float64
 6   dropoff_latitude   100000 non-null  float64
 7   passenger_count    100000 non-null  int64  
dtypes: float64(5), int64(1), object(2)
memory usage: 6.1+ MB
In [7]:
df_copy['pickup_datetime'] = pd.to_datetime(df_copy['pickup_datetime'], format= "%Y-%m-%d %H:%M:%S UTC")

# Change the time information from the pickup_datetime
df_copy['year'] = df_copy.pickup_datetime.apply(lambda t: t.year)
df_copy['weekday'] = df_copy.pickup_datetime.apply(lambda t: t.weekday())
df_copy['hour'] = df_copy.pickup_datetime.apply(lambda t: t.hour)
In [8]:
def distance(lat1, lon1, lat2, lon2):
    '''Receive coordinates of latitute and longitude, and calculate the distance between points
    using the formula of Haversine.
float, float, float, float --> float'''
    
    p = 0.017453292519943295 # pi/180
    a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
    
    return 0.6213712 * 12742 * np.arcsin(np.sqrt(a)) 
In [9]:
df_copy['distance'] = distance(df_copy.pickup_latitude, df_copy.pickup_longitude, 
                                    df_copy.dropoff_latitude, df_copy.dropoff_longitude)

3. Data Cleaning¶

Note that we have negative values in the fare_amount, so we need to exclude them, and the columns 'key' and 'pickup_datetime' won't be necessaries anymore because of our new columns¶

In [10]:
df_copy = df_copy[df_copy.fare_amount > 0]
df_copy = df_copy[df_copy.distance > 0]
df_copy = df_copy.drop(['key', 'pickup_datetime'], axis=1)
In [11]:
df_copy.head()
Out[11]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count year weekday hour distance
0 4.5 -73.844311 40.721319 -73.841610 40.712278 1 2009 0 17 0.640487
1 16.9 -74.016048 40.711303 -73.979268 40.782004 1 2010 1 16 5.250670
2 5.7 -73.982738 40.761270 -73.991242 40.750562 2 2011 3 0 0.863411
3 7.7 -73.987130 40.733143 -73.991567 40.758092 1 2012 5 4 1.739386
4 5.3 -73.968095 40.768008 -73.956655 40.783762 1 2010 1 7 1.242218

4. Data Visualisation¶

Visualising the geospatial locations for the pickup points.¶

In [12]:
from folium.plugins import HeatMap
# folium map created
m = folium.Map(location=[40.75, -73.98], zoom_start=11, tiles='CartoDB positron')

# creating the heat layer
heat_data = [[row['pickup_latitude'], row['pickup_longitude'], row['fare_amount']] for _, row in df_copy.iterrows()]
HeatMap(heat_data, radius=15, max_zoom=13).add_to(m)

m
Out[12]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Visualising the geospatial locations for the dropoff points¶

In [13]:
m_dropoff = folium.Map(location=[40.75, -73.98], zoom_start=11, tiles='CartoDB positron')

heat_data_dropoff = [[row['dropoff_latitude'], row['dropoff_longitude'], row['fare_amount']] for _, row in df_copy.iterrows()]
HeatMap(heat_data_dropoff, radius=15, max_zoom=13).add_to(m_dropoff)

m_dropoff
Out[13]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Histrogram plot of fare price¶

In [93]:
plt.style.use('ggplot')
plt.figure(figsize=(12, 5))

plt.hist(df_copy['fare_amount'], bins=100, color='skyblue')
plt.xlabel("Fare ($)")
plt.ylabel("Amount")
plt.title("Histogram of Fare ($)")
plt.show()
No description has been provided for this image

Barplot for visualizing the number of rides in the following years¶

In [91]:
year_counts = df_copy['year'].value_counts()

plt.figure(figsize=(15, 4))
plt.bar(year_counts.index, year_counts.values, color=palette)
plt.ylabel("Ride Count")
plt.xlabel("Year")
plt.title("Annual Ride Distribution")
plt.show()
No description has been provided for this image

Traffic in the year 2012¶

In [16]:
year2012_insight = df_copy[df_copy['year'] == 2012]
In [89]:
xlim = [-74.03, -73.85]
ylim = [40.70, 40.85]

year2012_traffic_insight = year2012_insight.copy()

year2012_insight = year2012_insight.query(
    "pickup_longitude > @xlim[0] and pickup_longitude < @xlim[1] and "
    "dropoff_longitude > @xlim[0] and dropoff_longitude < @xlim[1] and "
    "pickup_latitude > @ylim[0] and pickup_latitude < @ylim[1] and "
    "dropoff_latitude > @ylim[0] and dropoff_latitude < @ylim[1]"
)
In [88]:
fig, axes = plt.subplots(1, 2, figsize=(15, 7))

axes[0].plot(year2012_insight.dropoff_longitude, year2012_insight.dropoff_latitude, 'o', alpha = .5, markersize = 2, color="#fff", markeredgecolor='#000', markeredgewidth=1.5)
axes[0].plot(year2012_insight.dropoff_longitude, year2012_insight.dropoff_latitude, '.', alpha = .8, markersize = .5, color="red")
axes[0].legend(['Dropoff Points', "Pickup Points"])
plt.xlabel("\nTraffic in the Year 2012 \n(Black --> Dropoff Points, Red --> Pickup Points)")
plt.grid(False)

days_list = {'monday' : 0, 'tuesday' : 1, 'wednesday' : 2, 'thursday' : 3, 'friday' : 4, 'saturday' : 5, 'sunday' : 6}
weeklyTraffic = year2012_insight['weekday'].value_counts()
axes[1].pie(weeklyTraffic.values, labels=days_list, autopct="%.2f%%", explode=[0.1, 0.1, 0.1, 0, 0, 0, 0], colors=palette)
plt.xlabel("\nAmount of weekdays of Year 2012")
plt.show()
No description has been provided for this image
In [19]:
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
In [20]:
loc_df = pd.DataFrame()
loc_df['longitude'] = year2012_insight.dropoff_longitude
loc_df['latitude'] = year2012_insight.dropoff_latitude

kmeans = KMeans(n_clusters=15, random_state=2, n_init = 10).fit(loc_df)
loc_df['label'] = kmeans.labels_

plt.figure(figsize = (10, 10))
for label in loc_df.label.unique():
    plt.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 1, markersize = 0.8)

plt.title('Clusters of New York of year 2012')
plt.show()
No description has been provided for this image

Let's see the peaks days with their rush hours of year 2012¶

In [21]:
def visualize_peakDaysF(day, color='r'):
    year2012_monday_insight = year2012_insight[year2012_insight["weekday"] == day]
    day_name = list(days_list.keys())[day]
    plt.figure(figsize = (15, 70))
    
    max_pickup, max_pgcnt = 0, 0
    
    for hrs in range(24):
        specDay_traffic = year2012_monday_insight[year2012_monday_insight['hour'] == hrs]
        pickup = len(specDay_traffic)
        pgn_cnt = specDay_traffic["passenger_count"].sum()
        
        
        max_pickup = max(max_pickup, pickup)
        max_pgcnt = max(max_pgcnt, pgn_cnt)


        longitude = list(specDay_traffic.pickup_longitude) + list(specDay_traffic.dropoff_longitude)
        latitude = list(specDay_traffic.pickup_latitude) + list(specDay_traffic.dropoff_latitude)
        plt.subplot(24, 6, hrs+1)

        plt.title("\nHour: " + str(hrs) + " [pickup="+ str(pickup)+",\npassengers count="+ str(pgn_cnt)+"] ", fontsize=12)

        plt.grid(False)
        plt.xticks([])
        plt.yticks([])
        plt.plot(longitude,latitude,'.', alpha = 0.6, markersize = 10, color=color)

#         break
    plt.suptitle("\n"+ day_name.capitalize() +" (max pickups=" + str(max_pickup) + ", max passengers=" + str(max_pgcnt) + ")\n\n\n\n\n\n", fontsize=20)
    plt.tight_layout()
    plt.show()

Visualize the rush hours for monday¶

In [22]:
visualize_peakDaysF(0, color='#4856fb')
No description has been provided for this image

Visualize the rush hours for tuesday¶

In [23]:
visualize_peakDaysF(1, color='#10a2f0')
No description has been provided for this image

Visualize the rush hours forSunday¶

Now, let's see the rush hours of sunday in which the lowest traffic is generated of the 2011 year

In [24]:
visualize_peakDaysF(6, color='#ffa256')
No description has been provided for this image

Histrogram plot for the distances travelled¶

In [29]:
year2012_insight.distance.hist(bins=30, figsize=(15,4), color='#20beff')
plt.xlabel("distances")
plt.title("Histrogram plot of distance")
plt.show()
No description has been provided for this image

This histrogram gives us a veiw that most of the ride taken was a short ride

In [31]:
year2012_insight.groupby('passenger_count')[['distance', 'fare_amount']].mean()
Out[31]:
distance fare_amount
passenger_count
0 1.719203 8.408661
1 1.716163 9.625725
2 1.746542 9.949899
3 1.736295 9.793232
4 1.916772 10.491786
5 1.744920 9.582713
6 1.775214 10.102332
In [32]:
print("Average $USD/Mile : {:0.2f}".format(year2012_insight.fare_amount.sum()/year2012_insight.distance.sum()))
Average $USD/Mile : 5.61

Scatter plot visualization between Fare(in $USD) vs Distance(in Miles) of year 2012¶

In [44]:
plt.figure(figsize=(15,4))

plt.scatter(year2012_insight.fare_amount, year2012_insight.distance, c=year2012_insight.fare_amount, 
            cmap=plt.cm.rainbow, alpha=0.8, s=30, marker=".")
plt.xlabel("Fare(in $USD)")
plt.ylabel("Distance(in Miles)")
plt.title("Scatter plot Fare(in $USD) vs Distance(in Miles)\n")
plt.xlim(0,60)
ol = plt.grid(False)
plt.colorbar(ol)
plt.show()
No description has been provided for this image

Looking at this data, we can say:

  1. Some trips have zero distance but a fare bigger than zero. Maybe these are trips that started and ended in the same place? It will be hard to predict these fares because we don’t have enough information in the dataset.

  2. In general, there seems to be a (linear) connection between distance and fare.

  3. Most rides have an initial charge of $2.50 when you start.

  4. It also looks like someone paid a lot more than usual (> $120).

Note: The distance in the dataset is calculated in a straight line (point to point). In real life, the road distance is longer.<\b>

In [78]:
# removing datapoints with distance < 0.05 miles
print(f"Original size: {len(year2012_insight)}")
train_df = year2012_insight.query("distance >= 0.05")

print(f"New size: {len(train_df)}")
Original size: 14225
New size: 14149

5. Model (for the year 2012 only)¶

Based on the analysis we can build a BaseLine model, we will try a Linear Regression Model, XGBoost regression, Decision Tree regression, Random Forest Regression and LinghGBM.

In [46]:
model_data = train_df[['year', 'hour', 'distance', 'passenger_count', 'fare_amount']]
In [47]:
model_data.head()
Out[47]:
year hour distance passenger_count fare_amount
3 2012 4 1.739386 1 7.7
6 2012 20 0.966733 1 7.5
7 2012 17 2.582073 1 16.5
8 2012 13 0.778722 1 9.0
10 2012 7 0.854123 1 5.3

Train, test split of the datasets¶

In [48]:
X = model_data[['year', 'hour', 'distance', 'passenger_count']]
y = model_data[['fare_amount']]
In [49]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Linear Regression Run¶

In [50]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

model_lin = Pipeline((
        ("standard_scaler", StandardScaler()),
        ("lin_reg", LinearRegression()),
    ))
model_lin.fit(X_train, y_train)
Out[50]:
Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('lin_reg', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('lin_reg', LinearRegression())])
StandardScaler()
LinearRegression()
In [51]:
from sklearn.metrics import r2_score

y_test_pred = model_lin.predict(X_test)
score = r2_score(y_test, y_test_pred)
print("The accuracy of our model is {}%".format(round(score, 2) *100))
The accuracy of our model is 78.0%
In [53]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(model_lin, X_train, y_train, cv=5, scoring='r2') 

print("Cross-validation scores (R^2):", cv_scores)
print("Mean R^2:", cv_scores.mean())
print("Standard deviation of R^2:", cv_scores.std())
Cross-validation scores (R^2): [0.72157324 0.7675404  0.75781654 0.69141379 0.72282491]
Mean R^2: 0.7322337751726523
Standard deviation of R^2: 0.027457171287698062
In [76]:
## This function will automatizate the process to evaluate the models

def evaluate_models(model):
    y_test_pred = model.predict(X_test)
    score = r2_score(y_test, y_test_pred)
    print("The accuracy of our model is {}%".format(round(score, 2) *100))

    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2') 

    print("Cross-validation scores (R^2):", cv_scores)
    print("Mean R^2:", cv_scores.mean())
    print("Standard deviation of R^2:", cv_scores.std())

XGBoost Run¶

In [56]:
from xgboost import XGBRegressor

model_xgb = Pipeline((
        ("standard_scaler", StandardScaler()),  
        ("xgb_reg", XGBRegressor(objective='reg:squarederror', random_state=42)),  
    ))

model_xgb.fit(X_train, y_train)
Out[56]:
Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('xgb_reg',
                 XGBRegressor(base_score=None, booster=None, callbacks=None,
                              colsample_bylevel=None, colsample_bynode=None,
                              colsample_bytree=None, device=None,
                              early_stopping_rounds=None,
                              enable_categorical=False, eval_metric=None,
                              feature_types=None, gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=None, n_jobs=None,
                              num_parallel_tree=None, random_state=42, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('xgb_reg',
                 XGBRegressor(base_score=None, booster=None, callbacks=None,
                              colsample_bylevel=None, colsample_bynode=None,
                              colsample_bytree=None, device=None,
                              early_stopping_rounds=None,
                              enable_categorical=False, eval_metric=None,
                              feature_types=None, gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=None, n_jobs=None,
                              num_parallel_tree=None, random_state=42, ...))])
StandardScaler()
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)
In [71]:
evaluate_models(model_xgb)
The accuracy of our model is 74.0%
Cross-validation scores (R^2): [0.67019365 0.74424673 0.67675671 0.70065511 0.6861153 ]
Mean R^2: 0.695593499335429
Standard deviation of R^2: 0.026391554233688042

Decision Tree Regression¶

In [60]:
from sklearn.tree import DecisionTreeRegressor

model_tree = Pipeline((
        ("standard_scaler", StandardScaler()),  
        ("tree_reg", DecisionTreeRegressor(random_state=42)),  
    ))

model_tree.fit(X_train, y_train)
Out[60]:
Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('tree_reg', DecisionTreeRegressor(random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('tree_reg', DecisionTreeRegressor(random_state=42))])
StandardScaler()
DecisionTreeRegressor(random_state=42)
In [72]:
evaluate_models(model_tree)
The accuracy of our model is 61.0%
Cross-validation scores (R^2): [0.42052636 0.58426184 0.41315673 0.52118714 0.50310667]
Mean R^2: 0.4884477476040424
Standard deviation of R^2: 0.06441916539853423

Random Forest Regression¶

In [63]:
from sklearn.ensemble import RandomForestRegressor

model_rf = Pipeline((
        ("standard_scaler", StandardScaler()),  
        ("rf_reg", RandomForestRegressor(random_state=42, n_estimators=100)),  
    ))

model_rf.fit(X_train, y_train)
Out[63]:
Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('rf_reg', RandomForestRegressor(random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('rf_reg', RandomForestRegressor(random_state=42))])
StandardScaler()
RandomForestRegressor(random_state=42)
In [73]:
evaluate_models(model_rf)
The accuracy of our model is 75.0%
Cross-validation scores (R^2): [0.67489505 0.73322172 0.71141027 0.68725541 0.70125229]
Mean R^2: 0.7016069482950061
Standard deviation of R^2: 0.02007593953599224

LightGBM regressor¶

In [67]:
from lightgbm import LGBMRegressor

model_lgbm = Pipeline((
        ("standard_scaler", StandardScaler()), 
        ("lgbm_reg", LGBMRegressor(random_state=42)),  
    ))

#
model_lgbm.fit(X_train, y_train)
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000436 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 288
[LightGBM] [Info] Number of data points in the train set: 10611, number of used features: 3
[LightGBM] [Info] Start training from score 9.622901
Out[67]:
Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('lgbm_reg', LGBMRegressor(random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('lgbm_reg', LGBMRegressor(random_state=42))])
StandardScaler()
LGBMRegressor(random_state=42)
In [75]:
evaluate_models(model_lgbm)
The accuracy of our model is 78.0%
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000177 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 288
[LightGBM] [Info] Number of data points in the train set: 8488, number of used features: 3
[LightGBM] [Info] Start training from score 9.630914
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000143 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 288
[LightGBM] [Info] Number of data points in the train set: 8489, number of used features: 3
[LightGBM] [Info] Start training from score 9.616115
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000124 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 288
[LightGBM] [Info] Number of data points in the train set: 8489, number of used features: 3
[LightGBM] [Info] Start training from score 9.663482
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000140 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 288
[LightGBM] [Info] Number of data points in the train set: 8489, number of used features: 3
[LightGBM] [Info] Start training from score 9.606879
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000095 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 288
[LightGBM] [Info] Number of data points in the train set: 8489, number of used features: 3
[LightGBM] [Info] Start training from score 9.597114
Cross-validation scores (R^2): [0.71743692 0.77811075 0.75262279 0.71828627 0.7383625 ]
Mean R^2: 0.7409638475739545
Standard deviation of R^2: 0.022761280336944113

6. Conclusion¶

After the analysis of the data we constructed some models to predict the NYC Taxi Fare price based on some variables. The better results we had in the prediction were from the Linear Regression Model and the LightGBM. The results of these two models are almost the same, and the two of them are light and don't use a lot of computational ressources.